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This paper develops nonparametric estimation for discrete choice models based on the mixed 
multinomial logit (MMNL) model. It has been shown that MMNL models encompass all discrete 
choice models derived under the assumption of random utility maximization, subject to the iden- 
tification of an unknown distribution G. Noting the mixture model description of the MMNL, we 
employ a Bayesian nonparametric approach, using nonparametric priors on the unknown mixing 
distribution G, to estimate choice probabilities. We provide an important theoretical support for 
the use of the proposed methodology by investigating consistency of the posterior distribution 
for a general nonparametric prior on the mixing distribution. Consistency is defined according 
to an Li-type distance on the space of choice probabilities and is achieved by extending to a 
regression model framework a recent approach to strong consistency based on the summability 
of square roots of prior probabilities. Moving to estimation, slightly different techniques for non- 
panel and panel data models are discussed. For practical implementation, we describe efficient 
and relatively easy-to-use blocked Gibbs sampling procedures. These procedures are based on 
approximations of the random probability measure by classes of finite stick-breaking processes. 
A simulation study is also performed to investigate the performance of the proposed methods. 

Keywords: Bayesian consistency; blocked Gibbs sampler; discrete choice models; mixed 
multinomial logit; random probability measures; stick-breaking priors 

1. Introduction 

Discrete choice models arise naturally in many fields of application, including mar- 
keting and transportation science. Such choice models are based on the neoclassical 
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economic theory of random utility maximization (RUM). Given a finite set of choices 
C = {1, . . . , J}, it is assumed that each individual has a utility function 

Uj=^jl3 + ej forjeC. 

The values x = (xi, . . . ,x,/) are observed covariates, where Xj G denote the covariates 
associated with each choice {j} S C, the coefficient (3 is an unknown (preference) vector 
in and (ei,...,ej) are random terms. Suppose that all Uj are distinct and that 
the individual makes a choice { j} if and only if Uj > Ui Vl ^ j . The introduction of the 
random error terms Ej represents a departure from classical economic utility models. The 
random errors account for the discrepancy between the actual utility, which is known 
by the chooser, and that which is deduced by the experimenter who observes x and 
the choice made by the individual. Hence, the deterministic statement of choice {j} is 
replaced by the probability of choosing {j}, that is, P{Uj > Ui Vl ^ j}- The analysis of 
such a model depends on the specifications of the errors. McFadden (1974) shows that 
the specification of independent Gumbel error terms leads to the tractable multinomial 
logit (MNL) model. This representation is written as 

exp|x'/3| 

The MNL possesses the property of independence from irrelevant alternatives (HA), 
which makes it inappropriate in many situations. The probit and the generalized ex- 
treme value models, which do not exhibit the IIA property and are models derived from 
dependent error structures, have been proposed as alternatives to the MNL. A drawback 
of the aforementioned procedures is that they are not robust against model misspecifi- 
cation. 

The mixed multinomial logit (MMNL) model, first introduced by Cardell and Dunbar 
(1980), emerges as potentially the most attractive model. The book by Train (2003) 
includes a detailed discussion of this model. The general MMNL choice probabilities are 
defined by mixing an MNL model over a mixing distribution G. For a set of covariates 
X, the MMNL model is written as 

McFadden and Train (2000) establish the important result that, in theory, all RUM 
models can be captured by correct specification of G. Thus, a robust approach amounts to 
being able to employ statistical estimation methods based on a nonparametric assumption 
on G. However, statistical techniques have only been developed for the case where G is 
given a parametric form. The most popular model is when G is specified to be multivariate 
normal with unknown mean fi and covariance matrix r: 

P({j}|M,t,x)= / cxp{x/3} ^^^^ forjeC, (2) 
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where (/)(/3|/x,r) represents a multivariate normal density with parameters and r. We 
shall refer to this as a Gaussian mixed logit (GML) model. Here, based on a sample 
of size 71, one estimates the choice probabilities by estimating fj. and t. Applications 
and discussions are found in, among others, Bhat (1998), Brownstone and Train (1999), 
Erdem (1996), Srinivasan and Mahmassani (2005) and Walker, Bcn-Akiva and Bolduc 
(2007). Additionally, Dube et al. (2002) provide a discussion focused on applications to 
marketing. The GML model is popular since it is flexible and relatively easy to estimate 
via simulated maximum likelihood techniques or via Bayesian MCMC procedures. Other 
choices for G include the lognormal and uniform distributions. Train (2003) discusses the 
merits and possible drawbacks of Bayesian MCMC procedures versus simulated maximum 
likelihood procedures for various choices of G. However, despite the attractive features 
of the GML, it does not encompass all RUM models, hence, it is not robust against 
misspccification. 

In this article, we develop a nonparamctric Bayesian method for the estimation of the 
choice probabilities and we prove consistency of the posterior distribution. The idea is 
to model the mixing distribution G via a random probability measure in order to fully 
exploit the flexibility of the MMNL model. Many nonparamctric priors are currently 
available for modeling G, such as stick-breaking priors, normalized random measures with 
independent increments and Dirichlet process mixtures. We establish consistency of the 
posterior distribution of G under neat sufficient conditions which are readily verifiable 
for all of these nonparamctric priors. Consistency is defined according to an Li-type 
distance on the space of choice probabilities by exploiting the square root approach 
to strong consistency of Walker (2003a, 2004). We essentially show that the Bayesian 
MMNL model is consistent if the prior on G has the true mixing distribution in its weak 
support and satisfies a mild condition on the tails of the prior predictive distribution. 
We then move to estimation and divide our discussion into methods for non-panel and 
panel data. Specifically, for non-panel data models, we use, as a prior for G, a mixture of 
Dirichlet processes. Methods for panel data instead involve a Dirichlet mixture of normal 
densities. For practical implementation, we describe efficient and relatively easy-to-use 
blocked Gibbs sampling procedures, developed in Ishwaran and Zarepour (2000) and 
Ishwaran and James (2001). 

The rest of the paper is organized as follows. In Section 2, we describe the Bayesian 
nonparamctric approach by placing a nonparamctric prior on the mixing distribution 
and present the consistency result for the posterior distribution of G. In Section 3, we 
show how to implement a blocked Gibbs sampling for drawing inference when a discrete 
nonparamctric prior is used. Section 4 deals with panel data with similar Bayesian non- 
parametric methods, where we define a class of priors for G that preserves the distinct 
nature of individual preferences and specialize the blocked Gibbs sampler to this setting. 
In Section 5, we provide an illustrative simulation study which shows the flexibility and 
good performance of our procedures. Finally, in Section 6, we provide a detailed proof of 
consistency. 
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2. Bayesian MMNL models 

A Bayesian nonparamctric MMNL model is spceified by placing a nonparamctrie prior 
on the mixing distribution G in (1): 

P(b-}|G,x)^/ '^P^^^f^ G(d/3) forjeC. (3) 

Here, G denotes a random probability measure which takes values over the space P of 
probability measures on R'', the former endowed with the weak topology. The nonpara- 
mctric distribution of G is denoted by V . Model (3) can be equivalently expressed in 
hierarchical form as 

ind exp{x-y./3,;} 
/3, - = 1 I a^ for 2 = and e C, 

(3i\G'^G for i=:l,...,n, (4) 

G r^V 



with Xi = (xii, . . . ,x,;j) the covariates and Yi the choice observed for individual i. 

One can choose G to be a Dirichlet process (Ferguson (1973)), although there cur- 
rently exist other nonparamctric priors that can be used, like stick-breaking priors (Ish- 
waran and James (2001)) and normalized random measure with independent increments 
(NRMI) (Regazzini, Lijoi and Priinster (2003)). All of these priors select discrete distribu- 
tions almost surely (a.s.), whereas random probability measures whose support contains 
continuous distributions can be obtained by using a Dirichlet process mixture of densi- 
ties, in the spirit of Lo (1984). An important role in the sequel will be played by the prior 
predictive distribution of G, denoted by H , which is an element of P and is defined by 

H{B):=¥.[G{B)] (5) 

for all Borel sets B of M'', where E(-) denotes expectation. In the next section, we show 
that an essential condition for consistency of the posterior distribution is expressed in 
terms of H . This yields an easy-to-use criterion for the choice of the prior for G as iJ is 
readily obtained for all of the nonparamctric priors listed above. Furthermore, one can 
embed a parametric model, such as the GML, within the nonparamctric framework via 
a suitable specification of the distribution H. 



2.1. Posterior consistency 

Bayesian consistency deals with the asymptotic behavior of posterior distributions with 
respect to repeated sampling. The problem can be set in general terms as follows: suppose 
the existence of a "true" unknown distribution Po that generates the data, then check 
whether the posterior accumulates in suitably-defined neighborhoods of Pq- There exist 
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two main approaches to the study of strong consistency, that is, consistency when the 
neighborhood of Pq is defined according to the Helhnger metric on the space of density 
functions. One is based on the metric entropy of the parameter space and was set forth in 
Barron. Schcrvish and Wasscrman (1999) and Ghosal, Ghosh and Ramamoorthi (1999). 
The second approach was introduced by Walker (2003a, 2004) and has more of a Bayesian 
flavor, in the sense that it relies on the summabihty of square roots of prior probabihties. 
For discussion, the reader is referred to Wasserman (1998), Walker, Lijoi and Priinster 
(2005) and Choudhuri, Ghosal and Roy (2005). Strong consistency in mixture models for 
density estimation is addressed by Ghosal, Ghosh and Ramamoorthi (1999) and Lijoi, 
Priinster and Walker (2005), by using the metric entropy approach and the square root 
approach, respectively. As for the non-idcntically distributed case, we mention Choi and 
Schcrvish (2007) and Ghosal and Roy (2006), both of which follow the metric entropy 
approach. The square root approach is adopted by Walker (2003b) for nonparametric 
regression models and by Ghosal and Tang (2006) for the estimation of transition densities 
in the context of Markov processes. 

We face the issue of consistency for the MMNL model (3) by exploiting the square 
root approach of Walker and its variation proposed in Lijoi, Priinster and Walker (2005) 
which makes use of metric entropy in an instrumental way. We assume the existence of 
a Go € P such that the true distribution of Y given X = x is given by 

cxp(x^/3) 



^o({j}|x)=/ ^ ' '^. G^m- 

Jr" L/6C exp(x;/3) 

The variables are taken as independent draws from a common distribution M(dx) 
which is supported on A" C M"^'^. The distribution of an infinite sequence {Yi,'X.i)i>i will 
be then denoted by P^^ j . Finally, let Vn denote the posterior distribution of G given 
(Fi,Xi), . . . , (y„,X„); see also equation (19) in Section 6. In the sequel, we take the 
covariate distribution M to be a fixed quantity so that the posterior distribution docs 
not depend on the specific form of M . Note, however, that the posterior evaluation is 
also not affected when M is considered as a parameter with an independent prior since 
it is reasonable to assume that the choice probabilities are unrelated to M. 

We give conditions on Go and the prior predictive distribution of G such that the 
posterior distribution Vn concentrates all probability mass in neighborhoods of Go de- 
fined according to strong consistency of choice probabilities. To this end, we look at 
the vector of choice probabilities as a vector- valued function q: A" — >■ A, where A is the 
J-dimensional probability simplex. We define 

q(x;G) = [P({l}|G,x),...,P({J}|G,x)] (6) 

for any G £ P. On the space =S — {q(-; G): G e P}, we define the Li-type distance 

d(qi,q2)=/ |qi(x)-q2(x)|Af(dx), (7) 
Jx 



where | • | denotes the Euclidean norm in A. 
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Definition 1. V is consistent at Gq if, for any e > 0, 



Vn{G: d(q(-;G),q(-;Go))>e}^0, 



lOO 



(Go,M) 



-a.s. 



The main result is stated in the fohowing theorem. 

Theorem 1. Let V be a prior on P with predictive distribution H and Gq be in the weak 
support of V . Suppose that X is a compact subset of M.'''^ . If 

(i) Po({j}|x) > for any j G C and x e X ; 



then V is consistent at Gq. 

The compactness of the covariate space is a standard assumption in nonparametric 
regression problems. Condition (i) is fairly reasonable since it is guaranteed by a correct 
specification of the RUM model: one can always redefine the set of choices or the covariate 
space to fulfill this requirement. Moreover, because of the compactness of X, condition 
(i) implies that Go is a proper distribution on W^, that is, with no masses escaping at 
infinity. The verification that Go belongs to the weak support of V is then an easy task: 
in general, it is sufficient that the prior predictive distribution H has full support on W^. 
Condition (ii) is a mild condition on the tails of H: it is satisfied by any distribution with 
tails lighter than the Cauchy distribution. 

2.2. Illustration 

It is worth considering condition (ii) in more detail for a variety of Bayesian MMNL 
models, obtained from different specifications of P. If G is taken to be a Dirichlet process 
with base measure a = aF, where a > is a constant and F G P, then F coincides with 
H in (5). A larger class of Bayesian MMNL models arise when G is chosen to be a 
stick-breaking prior: 



where the pk are positive random probabilities chosen to be independent of Zk and 
such that '^i^^iPk = 1 a-s. The Zk are random locations taken as independent draws 
from some non-atomic distribution F in P. What characterizes a stick-breaking prior 
is that the random weights are expressible as pk = ^^11^=1^(1 ~ where the Vk are 
independent beta-distributed random variables of parameters ak,bk > 0; we write Vk ~ 
beta(afc, bk). Examples of random probability measures in this class are given in Ishwaran 
and James (2001); see also Pitman and Yor (1997) and Ishwaran and Zarepour (2000). 
They represent extensions of the Dirichlet process, which has = 1 and bk = a Vfc, and 
they all have in common that the prior predictive distribution H coincides with F. 
The class of NRMI is another valid choice for V. Specifically, one can take G(-) = /i(-)/ 
where /i is a completely random measure with Poisson intensity measure 



(ii) 4, imm < +00, 




(8) 



fe>i 
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v{dv,(iz) = p{dv\z)a{dz) on (0,+cx3) x K"^. Here, p{-\z) is a Levy density on (0,+oo) for 
any z and a is a finite measure on such that •= /ii{dxR+ ~ e~"'")p{(iv\z)a{<Az) < 
oo, which is needed to guarantee that /i(R'^) < oo a.s. It can be shown that H{B) = 
/b /o ^-'^^"H/o e-™ X vp{dv\z)}Aua{dz) for any Borcl set B oiW-; see also James, 
Lijoi and Priinster (2009). When p{(iv\z) ~ p{dv) for each z (homogeneous case), the 
prior predictive distribution reduces to 

H{B) = for any Borel B C W^. (9) 

The homogeneous NRMI includes, as a special case, the Dirichlet process and belongs, 
together with the stick-breaking priors, to the class of species sampling models^ for which 
(9) holds for some finite measure a. Note that all of the nonparametric priors belonging 
to this class allow an easy verification of condition (ii). 

The specification of the nonparametric prior in terms of a base measure a, as in (9), 
allows more flexibility to be introduced via an additional level in the hierarchal structure 
(4). If we let the base measure be indexed by a parameter 0, say ag, and 9 be random 
with probability density tt{6) on some Euclidean space 0, then we obtain a mixture of 
Dirichlet process in the spirit of Antoniak (1974). Condition (ii) must then be verified 
for the convolution 

H{B)= I [ H0{dz)TT{0)de, where ife(dz)==^4S- 
JeJB aeiW^) 

It is quite straightforward to check that condition (ii) holds for the mixture of Dirichlet 
processes implemented in the analysis of non-panel data in Section 3. 

Finally, consider the case of Dirichlet process mixture models of Lo (1984), where G is 
absolutely continuous with respect to the Lebesgue measure on M** with random density 
function specified as Jq i^(/3, 6')II(d0). Here, K{(3,9) is a non- negative kernel defined on 
M'' X 8 such that, for each 6 £Q, J^^ K{z, 6)dz = 1, while H is a Dirichlet process prior 
with base measure aF and F a probability measure on O. The distribution H is then 
absolutely continuous and is given by 



H{B)= [ [ K{z,6)F{de)dz 
J B Je 



As in (10), verifying condition (ii) requires a study of the tail properties of a convolution, 
this time of K{z,9) with respect to F{d6). In the analysis of panel data (see Section 4), 
we adopt a Dirichlet mixture model as continuous nonparametric prior for G where the 
verification of condition (ii) can be readily established. 



3. Implementation for non-panel data 

Assume that we have a single observation for each individual and that we want to account 
for the possibility of ties among different individuals' preferences. Therefore, we use a 
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discrete nonparametric prior for the mixing distribution. Take G to be a Dirichlet process 
with base measure aF and denote its law by 'P{dG\aF) (although the treatment can be 
easily extended to any other stick-breaking prior). Representation (8) then holds with 
random probabilities pi,p2,... at locations ^1,^2,..., which are i.i.d. draws from F. 
This translates into a Bayesian model for the MMNL as 

One can then center G on a parametric model like the GML in (2) by taking F to have 
normal density 0(/3|/x, r). In a parametric Bayesian framework, by placing priors on /x, t, 
one is able to get posterior estimates of /i., r, but inference is restricted to the assumption 
of the GML model. The flexibility of the Bayesian nonparametric approach allows one 
to choose F based on convenience and ease of use and to utilize, for instance, the attrac- 
tive features of GML models while still maintaining the robustness of a nonparametric 
approach. 

In the case of the Dirichlet process, the parameters associated with F, for instance, /i. 
and T, are considered fixed. As observed in Section 2, one can introduce more flexibility 
in the model by treating such parameters as random. Specifying 9 = {fi,T), Fg(dl3) to 
have density 4>{f3\9) d/3 and Tr{9) to be the density function for 9, the law of G is given by 
the mixture /q V{dG\aFe)n{d9). Equivalently, using (8), a mixture of Dirichlet processes 
is defined by specifying each Zk\9 to be i.i.d. Fg. Note that, conditional on 9, a prior 
guess for the choice probabilities is 

r explx' 3} 

E[P({j}|G,x)|g]^ / fg(d/3) forjeC. (12) 

By the properties of the Dirichlet process, the prediction rule for the choice probabilities 
given /3;^, . . . , /3„ is given by 

E[P({j}|G,x)|0,/3i,...,/3„] 

(13) 

= ° P{[j]\Fo x) + 1 cxp{Xj/3J 

a + n a + "- E;gcCxp{x;/3J' 

where P{{j}\Fe,x.) := E[P({7}|G,x)|6'] is given in (12) with a notation consistent with 
(1). However, the variables /3j are not observable and hence one needs to implement 
computational procedures to draw from their posterior distribution. 

In this framework, a reasonable algorithm to use is the blocked Gibbs sampler devel- 
oped in Ishwaran and Zarepour (2000) and Ishwaran and James (2001). Indeed, since 
the multinomial logistic kernel docs not form a conjugate pair for (3, marginal algorithms 
suffer from slow convergence, although strategies for overcoming this problem can be 
found in MacEachern and Muller (1998). 
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3.1. Blocked Gibbs algorithm 

In this section, we discuss how to implement a blocked Gibbs sampling algorithm for 
drawing inference on a nonparametric hierarchical model with the structure 

Y,\13^ L{Y„f3,) for i = 1, . . . , n and e C, 
/3i|G-G ioT i = l,...,n, 

(14) 

G\e - V{dG\aFg), 
9 - TT{de), 

where L{Yi,f3) = cxp{x^Y.l3}/J2jec cxp{x-j/3} is the probability for Yi conditional on /3j. 
The blocked Gibbs sampler utilizes the fact that a truncated Dirichlet process, discussed 
in Ishwaran and Zarepour (2000) and Ishwaran and James (2001), serves as a good 
approximation to the random probability measure G\9 in (14). We replace the conditional 
law V {dG\aFg) with the law of the random probability measure 

N 

G(-) = ^Pfcfe,(-), l<iV<oo, (15) 

k=l 

where Zk\0 are i.i.d. Fg and the random probabilities pi, . . . ,pn are defined by the stick- 
breaking construction 

Pi = Vi and pfe = (l-Fi)---(l-Ffe_i)-(4, fc = 2,...,iV, (16) 

with Vi, V2, . . . , Vn-1 i.i.d. beta(l,a) and Vn = 1, which ensures that X^aLiP^ ~ 1- "^^^ 
law of G\9 in (15) is referred to as a truncated Dirichlet process and will be denoted 
{dG\aFg). Moreover, the limit as iV — 00 will converge to a random probability 
measure with law 7^(dG|a-F6i)- Indeed, the method yields an accurate approximation 
of the Dirichlet process for moderately large since the truncation is exponentially 
accurate. Theorem 2 in Ishwaran and James (2001) provides an Li-error bound for the 
approximation of conditional density of Y = (Ki, . . . , Yn) given 9. Let 

V^{dG\aFg) 

and ^{Y\9) be its limit under the prior V{dG\aFg). One then has 

ll/i^ - Mill := j Im'^CYI^) - /i(Y|0)| dY ^ 4ne-(^-l'/^ 

where the integral above is considered over the counting measure on the n-fold product 
space C". Moreover, Corollary 1 in Ishwaran and James (2002) can be used to show that 



/x^(Y|0) 



n „ 

n / i(r„A.)G(d/3, 
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the truncated Dirichlet process also leads to asymptotic approximations to the posterior 
that are exponentially accurate. 

The key to working with random probability measures like (15) is that it allows blocked 
updates to be performed for p = (pi, . . . ,p„) and Z = (Zi, . . . , Zn) by recasting the hi- 
erarchical model (14) completely in terms of random variables. To this aim, define the 
classification variables K = {i^i, . . . ,Kn] such that, conditional on p, each Ki is inde- 
pendent with distribution 

N 

P{A-.e-|p} = 5]p;.4(-)- 

k=l 

That is, P{ A'i = fc|p} —Pk for fc = 1, . . . , so that Ki identifies the Zk associated with 
each /3j: /3j = Zk^. In this setting, a sample /J^, . . . ,/3„ from (15) produces no < min(n, N) 
distinct values. The blocked Gibbs algorithm is based on sampling K,p,Z,6' from the 
distribution proportional to 



.i=l 



N 



.1=1 k=i 



7r(p) 



N 



\{Fe{dZk) 



.k=l 



7T{d9), 



where 7r(p) denotes the distribution of p defined in (16). This augmented likelihood is 
an expression of the augmented density when V{dG\aFg) is replaced by V^{dG\aFg). 

Before describing the algorithm, we specify choices for Fg and 6 which agree with 
the GML model. Set 9 = (/i,t) and specify the density of Fg to be 0(/3|/x,t). Let A 
denote a positive scalar. We choose a multivariate normal inverse Wishart distribution 
for fi,T, where, specifically, ij,\t is a multivariate normal vector with mean parameter m 
and scaled covariance matrix A~^r and r is drawn from an inverse Wishart distribution 
with degrees of freedom vq and scale matrix Sq. We denote this distribution for /x,t as 
N-IW(m, X^^T, 1^0, Sq). Our specification is similar to that used in Train (2003), Chapter 
12, for a parametric GML model for panel data. 



Algorithm 1. 

1. Conditional draw for K. Independently sample Ki according to V{Ki € -jp, Z, Y} = 
Y.k=iPk,t^k{-) for i = l,...,ri,, where 

(pi,j, . . .,pN.i) fx {piL[Yi, Zi), . . .,pnL(Yi,Zn)). 

2. Conditional draw for p. pi = V{ , pk = {I - V{) ■■■{!- V^_^)V^ , fc = 2, . . . ,iV - 1 
and = 1, where, if Ck records the number of Ki values which equal fc, 

V^'^hetJl + ek,a+ ^ j , A; = 1, . . . , - 1. 

\ i=k+i / 

3. Conditional draw for Z. Let {K*, . . . , K*^} denote the imique set of Ki values. 
For each k ^ {K* , . . . , K*^ } . draw r from the prior multivariate normal den- 
sity 4>{Z\iJi,T). For j = l,...,no, draw Zk'^ /3j from the density proportional 
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to 4'{f^j\fJ'i^)Yl{i-Ki=K'} ^O^i^f^'j) using, for example, a standard Metropolis- 
Hastings procedure. 

4. Conditional draw for 6 ~ {p,,T). Conditional on t,K,Z, Y, draw fi from a multi- 
variate normal distribution with parameters 



Am + no/3„^, 



and 



A + no A + 7io 

where ^^^^ ^n^^ 5I]j=i /3j • Conditional on K, Z, Y, draw r from an inverse Wishart 
distribution with parameters 



+ no and 



I/O + no 
where 

1 A 

S"o = -E(/3;-^„o)(^;-^no)' and i?(^„„,m) = -^(3„„-m)(3„^-m)'. 

TlQ . A -|- no 

Notice that, when no = 1, Steps 3 and 4 reduce to the MCMC steps for a para- 
metric Bayesian model. Iterating the steps above produces a draw from the distri- 
bution Z,K,p,0|Y. Thus, each iteration m defines a probability measure 
SfcLi Pfe"'''^z<'"' (■)' ■^hich eventually approximates draws from the posterior distribution 

~ k 

of G|Y. Consequently, one can approximate the posterior distributional properties of the 
choice probabilities P{{j}\G,x.) by constructing (iteratively) 



P(W|G('"),x) = ^p1' 



explx-Z^^} 
E,,cexp{xjzi")}' 



k=l z^iec^^yi-^i'^k 

see (11). For instance, an histogram of the P({j}|G''™-',x). for ?7i — 1,...,M, approxi- 
mates the posterior distribution. An approximation to the posterior mean E[P({j}|G, x)|Y] 
is obtained by M~^^^^-^^P{{j}\G^"^\x) or, alternatively, by 

1 

P({j}|x):=-EE[P({j}|G,x)|0(™),M'"\...,/3r)], (17) 

m— 1 

where E[P({j}|G,x)|0,^i,...,/3„] is given in (13) and O^""^ = z'^^'ll . 



4. Bayesian modeling for panel data 

The MMNL framework may also be used to model choice probabilities based on panel 
data. In the panel data setting, each individual i is observed to make a sequence of 
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choices at different time points. Tlie random utility for clioosing j for individual i in 
choice situation t is given by 

U^Jt=y^^Jtl^^+£^]t, J € C, 

for times t = 1, . . . ,Ti. The MMNL model can be described as follows [see Train (2003), 
Section 6.7]: given /3j, the probability that a person makes the sequence of choices = 
. . . , YiTi} is the product of logit formulae 

i(Y.A)-ll^^^^exp{x^,/3J- 

The MMNL model is completed by taking the /3j to be from a distribution G so that the 
unconditional choice probability is specified by 

P(Y,|G,x,)=/ []^^£i^^ifL^G(d/3)= / L(Y„/3)G(d/3), 

where = {x^t, ^ C^t = 1, . . . ,Ti\ denotes the array of covariates associated with the 
sequence of choices of individual i. Similarly to the non-panel data setting, we wish to 
model G as a random probability measure in a Bayesian framework. While it is possible 
to choose G to follow a Dirichlet process, this would result in possible ties among the 
individual's preferences (3^. In order to preserve the distinct nature of each individual's 
preference, we assume that, given G, the /3j are i.i.d. with distribution G, where G is a 
mixture of multivariate normal distributions with random mixing distribution 11. That 
is, G has random density Jq 0(/3|/x, r)n(d/x, dr), where <d — M!^ y. S with S the space 
of covariance matrices. Specifically, we take 11 to be a Dirichlet process with shape aF , 
F a probability measure on Q. Hence, the Bayesian MMNL model for individual i is 
expressible as 

P(Y,|G,x,)= / i(Y„/3)G(d/3)= / / L(Y„/3),^(/3|At,T)n(d/x,dr)d/3. 

While one may use any choice for F, we take F{A^i,At) to be the multivariate normal 
inverse Wishart distribution N-IW(m, A^^r, Sq, I'd) described in Section 3. 



4.1. Blocked Gibbs algorithm for panel data 

The explicit posterior analysis for the panel data case is quite similar to the non-panel 
case. The main difference is that the (/Xj,Ti), i = l,...,?T,, rather than /3j^, . . . , /3„, are 
drawn from the Dirichlet process. Here, we will briefly focus on the relevant data structure 
and then proceed to a description of how to implement the blocked Gibbs sampler. The 
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joint distribution of the augmented data can be expressed using a hierarchical model as 
follows: 



Y,|/3, '"^ £(Y„/3,) for i = 1, . . . ,n and Yu S C, 

(f>{f3i\fJ't,Ti) for i= 

iid ~ 



(18) 



/Xj,Ti|n~n fori = l,...,n, 

n ~ p(dn|aF). 



Similar to the non-panel case, the blocked Gibbs sampler works by using the 7'^(dn|aF) 
in place of the law of the Dirichlet process V {dIl\aF) . We now sample (K, p, Z, /J^, • ■ • ,/3„) 
from the distribution proportional to 



Y[LiY.,,f3,)m\t^,„r,) 



71 N 



.i=l k=l 



N 



k=l 



Here, we use the fact that (fi^jTi) = for i = 1, . . . ,n. To approximate the posterior 
law of various functionals, we cycle through the following steps. 

Algorithm 2. 

1. Conditional draw for K. Independently sample Ki according to 

N 

F{K, e -Ip, Z,f3„ . . . ,/3„, Y} = Y,pM-) for z = 1, . . . , n, 

k=l 

where (pi^^, . . ■,PN.i) oc (pi(^(/3j|Zi), . . . ,pN(l>if3i\ZN))- 

2. Conditional draw for p. pi = V^* , Pfc = (1 - V{) ■■■{!- V^_^)VC, k = 2,...,N-l 
and = 1, where, if Ck records the number of Ki values which equal k, 



N 



14* beta 1 + efc, a + ^ e; 

\ l=k+l / 



k = l,...,N ~l. 



3. Conditional draw for Z. Let {K^, . . . ,K*^} denote the unique set of Ki values. For 
each k ^ {A'j , . . . , A'*^}, draw Zk = {fij^,Tk) from the prior N-IW(m, A~^r, Sg, i^o)- 
For j = 1, . . . , no, draw Z/f * := (/x*, r*) as follows: (a) conditional on r*, K, . . . , 
/3„, Y, draw /x* from a multivariate normal distribution with parameters 



and 
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where /3* = [ei^) ^ I]{,:/f,=x*} A? (b) conditional on K,/3i, . . . ,/3„, Y, draw t* 
from an inverse Wishart distribution with parameters 

z^oSo + e/f.Sj + m) 
Vii + eK' and ; , 

J 

where 



S,- E and i?0;,m) = — -^0;-m)(3;-m)'. 



4. Conditional draw for /3]^, . . . ,/3„. For each j — l,...,no; independently draw /3j, 
? e {Z: A'/ AT*}, from the density proportional to L(Yi, /3j)(^(/3j|/x*, -r*) by using, 
for example, a standard Metropolis-Hastings procedure. 



When no = 1, Steps 3 and 4 equate with a parametric MCMC procedure for panel 
data models similar to the algorithm described in Train (2003), Section 12. 



5. Simulation study 

In this section, we present some empirical evidence that shows how the MMNL procedures 
perform overall and relative to GML models and finite mixture (FM) of MNL models. 
We proceed to the estimation of the choice probabilities based on simulated data. Two 
different artificial data sets are generated for the simulation study: data set 1 is produced 
for studying non-panel data models, while data set 2 is designed to study models with 
panel data. In both cases, we consider a RUM model with three possible responses (J — 3) 
relative to the utilities Ui,U2 and C/3, 

[/i = a;ii/3i+xi2/32+ei, 

U2 = X2l(il + X22f32 + £2, 
U3 = X3il3i + X32I32 + £3. 

As for data set 1, we choose £1,62, £3 ~ standard Gumbel and f3 = (/3i,/32)' 0.5 x 
'^(-5,5) + 0-5 X (5(5,-5) • For individual i, we randomly generate (componentwise) the co- 
variate vector = (xii,a;i2, 0:21, 0:22, 2:31, a;32), independently from a Uniform(— 2, 2) dis- 
tribution. Set Yi = j if Uij > Uii, I 7^ J, for j = 1,2,3. Repeat this procedure n times 
independently to obtain a data set with (li,Xi) for i — 1, . . . ,n. As for data set 2, we 
assume that there are n individuals, each making Ti ~ 10 choices for i ~ 1, . . . ,n. We 
then simulate data using the same model used to generate data set 1. The only change 
is that (3 is drawn from the two-component mixture of bivariate normal distributions, 
/3 - 0.5 X iV((-5,5)',2I) -1-0.5 x Ar((5, -5)', 21), where I is the identity matrix. 

We start by applying our procedures to the estimation of choice probabilities P({j}|x), 
for j = 1,2,3, based on the set of covariates x = (1.0,-0.9,1.0,0.2,1.0,0.9). The prior 
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Table 1. Simulation results for data set 1 (columns 3-4) and for data set 2 (columns 5-6) 
with x= (1.0,-0.9,1.0,0.2,1.0,0.9) - the estimates (Est.), the credible intervals (C.I.) and the 
root mean square (RMS) values are presented; GML = Gaussian mixed logit, MMNL — mixed 
multinomial logit 



Data set 1 (non-panel case) Data set 2 (panel case) 

n = 500 n = 100, Ti = 10 





True Est. 


(95% C.L) 


RMS 


True 


Est. 


(95% C.L) 


RMS 


GML P({l}|x) 
P({2}|x) 
P({3}!x) 


0.4980 0.3203 
0.0167 0.3348 
0.4853 0.3449 


(0.2907, 0.3501) 
(0.3308, 0.3377) 
(0.3191, 0.3715) 


0.2258 


0.4939 
0.0279 
0.4782 


0.4585 
0.0521 
0.4894 


(0.4476, 0.4685) 
(0.0378, 0.0675) 
(0.4717, 0.5061) 


0.0266 


MMNL P({l}|x) 
P({2}jx) 
P({3}|x) 


0.4980 0.4856 
0.0167 0.0257 
0.4853 0.4886 


(0.4748, 0.4945) 
(0.0069, 0.0551) 
(0.4615, 0.5057) 


0.0137 


0.4939 
0.0279 
0.4782 


0.4586 
0.0494 
0.4920 


(0.4495, 0.4670) 
(0.0329, 0.0679) 
(0.4705, 0.5107) 


0.0265 



parameters for the specifications of the Bayesian MMNL models for panel and non-panel 
data (pertaining to the explicit models in Sections 3 and 4) are set to be a = 1, I'o = 2, 
m = (0,0)', So = I and A = 1. Additionally, we use N = 100 for the truncation level in 
the blocked Gibbs Algorithms 1 and 2 given in Sections 3 and 4, respectively. A Bayesian 
GML model is also estimated for comparison virith the same specifications for i/q, m. So 
and A. In all cases, we use the estimator (17) based on an initial burn-in of 10,000 cycles 
and an additional 10,000 Gibbs cycles (M = 10,000) for the estimation. In addition, to 
measure how good of our estimates are, we define a measure, root mean square (RMS) 
value, as 



RMS: 



\ 



1 1 " 

7 E M E (P({j}|GW,x) - Po({j}|x))' 

jeC m=l 



where Po({j}|x) is the choice probability resulting from the data generating process. 

Simulation results using data set 1 (n = 500) and data set 2 (n ~ 100, = 10) are 
summarized in Table 1, together with RMS values, for both the GML and the MMNL 
models. They show that the performance of the nonparametric MMNL model is better 
than that of the parametric GML model in the non-panel case, as indicated by a smaller 
RMS value and more accurate estimates of choice probabilities, while the GML and 
MMNL models display similar performances in the panel case. As expected, the GML 
model suffers from misspecification in the non-panel case, while the two-component mix- 
ture of bivariate normals used for generating data set 2 is correctly accounted for by the 
GML because of the hyperprior on the parameter (/x,t) we are using. We then get con- 
firmation that the fit of the MMNL model is as good as that of the GML model. We also 
performed estimation of the MMNL model for different choices of the scale parameter A 
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dataset 1 (non-panel case 




dataset 2 (panel case) 



Figure 1. MMNL model: Autocorrelation functions for the choice probability P({l}|x) for data 
set 1 (left) and data set 2 (right), obtained from the posterior sample of the /3's for the MMNL 
model with prior hyperparameter A = 0.01 (dashed) and A = 1 (dotted). 



(not reported here) which show two different behaviors for the non-panel and the panel 
case. As for the non-panel case, RMS values and the estimates remain stable, whereas, in 
the panel case, the estimates are more accurate when we decrease A with slightly smaller 
RMS values. An interpretation of an increase of accuracy is as follows: a smaller A cor- 
responds to a more diffuse H , the prior predictive distribution of G. Since H is different 
from the distribution used to simulate the /3's in the data generating process, we obtain 
evidence that a diffuse H helps in capturing the true form of the mixing distribution 
G. Also, note that a smaller A yields a smaller RMS, the latter being a measure of the 
combination of the accuracy and the variability of the posterior variates of P({j}|x). An 
examination of their autocorrelation functions along the chain shows that a smaller A 
causes a slower mixing of the blocked Gibbs sampler, which increases the component of 
variability in the RMS; see Figure 1. The decrease in RMS then shows that such precision 
loss is more than balanced by a higher accuracy of the estimate, although one should 
also control the convergence properties of the sampler by avoiding taking A too small. 
We investigated the sensitivity of the results to the prior parameter u^^ where a larger 
corresponds to a more concentrated inverse Wishart distribution on Sq. However, we 
did not observe substantial differences in the estimation by varying and we decided to 
set fo = 2 and Sq = I as a default non-informative choice for these parameters; see Train 
(2003), Section 12. The nonparametric prior on G is also dependent on the total mass a, 
which is positively related to the number of components in the mixture distribution of 
the /3's. Generally, a = 1 is considered a default choice for a finite mixture model with 
a fixed, but uncertain, number of components. We performed estimation for larger a, 
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obtaining almost identical results: a = 1 was, in fact, sufficient for detecting the two- 
component mixture we used in generating the data. Although we have not done so, the 
blocked Gibbs procedures described in Sections 3 and 4 can be easily extended to place an 
additional prior on a. Furthermore, the truncation level of = 100 in (15) is sufficiently 
large as we observed almost identical estimation results from runs of the blocked Gibbs 
sampler with larger values of A^. 

The second simulation study aims at the verification of the consistency result of Section 
2 by estimating the MMNL model for increasing sample sizes for both data set 1 and 
data set 2. We also sample (3 variates from their posterior distribution, thus obtaining 
approximated evaluation of the mixing distribution G. The prior parameters are set as 
a = 1, vq = 2, m = (0,0)', So = I, A^ = 100 and A = 1. Table 2 reports the results by 
showing, as expected, a noticeable decrease of RMS for both non-panel and panel data 
as the number of observations increases. In addition. Figure 2 reports the histograms of 
samples for /3i from its marginal posterior distribution against the mixing distribution 
used in the data generating process: it shows how the approximation of the true mixing 
distribution G improves as more and more data become available. 

Finally, we evaluate the performance of the Bayesian MMNL model via a comparison 
with the finite mixture (FM) MNL model estimated via the EM algorithm described in 
Train (2008), Section 4. The FM MNL model can be considered nonparametric in the 
sense that the locations and weights of the mixing distribution G arc both assumed to 
be parameters. The selection of the number of points in the mixing is based on the BIG 
information criterion. We consider 500 Monte Garlo replicates of each of the following 6 
situations: data set 1 with sample sizes n ~ 50, 100 and 500; data set 2 with [n = 10, Ti = 
10), [n = 50, = 10) and {n = 100, T; = 10). For a given sample, the posterior estimate 
of P({j}|x) in equation (17) is computed, based on 6000 Gibbs cycles after a burn-in 
period of 4000 for j = 1,2,3 and for x in a 6-dimensional grid of the hypercube (—2, 2)^ 
of 5^ equally-spaced points. At the same time, we compute the FM MNL estimate of 
P({j}|x) for i = 1, 2, 3, evaluated on the same grid of x-points. We call q(x) and qo(x) the 
estimated vector and the true vector of choice probabilities evaluated at x, respectively. 
We measure the overall error of estimation with the Li-distance |q(x) — qo(x)|dx, 
which corresponds to the (rescaled) distance (i(q, qo) in equation (7), with Af (dx) being 

Table 2. MMNL model: estimates and the root mean square (RMS) for data set 1 and for data 
set 2 with x = (1.0, —0.9, 1.0,0.2, 1.0,0.9) and different sample sizes 



Data set 1 (non-panel case) Data set 2 (panel case) 





True 


n = 50 


n= 100 


71 = 500 


True 


71= 10 

r, = 10 


n = 50 
= 10 


71= 100 
= 10 


P({l}|x) 


0.4980 


0.4927 


0.5145 


0.4856 


0.4939 


0.5956 


0.4176 


0.4586 


P({2}|x) 


0.0167 


0.1046 


0.0489 


0.0257 


0.0279 


0.0527 


0.0562 


0.0494 


P({3}|x) 


0.4853 


0.4027 


0.4366 


0.4886 


0.4782 


0.3517 


0.5261 


0.4920 


RMS 




0.0867 


0.0440 


0.0137 




0.0977 


0.0556 


0.0265 
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n = 50 



— ^^-r^ 



dataset 1 (non-panel case) 
n = 100 



n = 500 



1 1 i-i-rrT 



-15 -10 -5 5 10 15 -15 -10 -5 5 10 15 -15 -10 -5 5 10 15 



n = 10, Tj = 10 



dataset 2 (panel case) 
n = 50, T, = 10 




n = 100, Ti = 10 




-15 -10 -5 



-15 -10 -5 



Figure 2. MMNL model: histogram estimate of tlie posterior marginal density of /3i's for data 
set 1 (top) and for data set 2 (bottom) and different sample sizes. The solid lines represent the 
true mixing distribution. 



Table 3. Average Li-error from 500 Monte Carlo replicates - FM MNL — finite mixture of 
multinomial logit; MMNL — mixed multinomial logit 





Data set 1 ( 


Qon-panel case) 




Data set 2 (panel case) 






71 = 50 


71= 100 


n = 500 


n= 10 
T, = 10 


71 = 50 
T, = 10 


n = 100 
Ti = 10 


FM MNL 
MMNL 


0.0521 
0.0577 


0.0295 
0.0316 


0.0107 
0.011 


0.0891 
0.0827 


0.0505 
0.0467 


0.0297 
0.0268 



the uniform distribution on the hypercube (—2,2)^. We compute the Li-error for the 
Bayesian MMNL estimator and the FM MNL estimator, then average over the 500 Monte 
Carlo replicates. The results are reported in Table 3 and show that the MMNL estimators 
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outperform the FM MNL estimators in the panel case for ah sample sizes, while in the 
non-panel case, the situation is reversed, with a similar performance for n = 500. Note, 
however, that data set 1 is generated exactly from a finite mixture model so that the FM 
MNL model is expected to perform well. Overall, the decrease in the average error for 
larger sample sizes is a further confirmation of the consistency result of Section 2. 



6. Proof of Theorem 1 

Throughout this section, we work with the family of multinomial logistic kernels 

exp(a;^/3) 

With qj{x;G) denoting the jth element of the vector q(x;G), we have that qj{x;G) = 
/jjd kj (x, (3)G{d(3). Note that (x; Go) is the joint density of (y, X) with respect to the 
counting measure on the integer set C and the measure Af(dx) on X . 

For the proof of Theorem 1, the following lemma is essential, stating that on the space 
P, the weak topology and the topology induced by the Li-distance d defined in (7) are 
equivalent. 

Lemma 1. Let dw be any distance that metrizes the weak topology on P and (G„)„>i 
be a sequence in P. Then dw{Gn, Gq) if and only if (i(q(-; G„), q(-; Gq)) — > 0. 

Proof. For the "if" part, it is sufficient that (i„(G„, Go) — !• implies that |'7j(x; G„) — 
(7j(x; Go)|Af (dx) for an arbitrary j & C. The latter is a consequence of the definition 
of weak convergence and an application of Scheffe's theorem since kj (x, /3) is bounded and 
continuous in (3 for each x e <Y. To show the converse, wc prove that G being distant from 
Go in the weak topology implies that q(-;G) is distant from q(-;Go) in the Li-distance 
d. Define a weak neighborhood of Go as 



V=< G 



fcj-(x,/3)M(dx)G(d;3) - / / fcj(x,;3)A/(dx)Go(d/3) 
X Jm'' Jx 



<S,jeC 



Since fcj (x, /3)A/(dx) is a bounded continuous function on R"* for each j, G & 
implies that (G,Go) > S. Based on the inequalities 

ci(q(-;G),q(-;Go)) >max / |g, (x; G„) - <?, (x; Go)|M(dx) 

[ [ fcj(x,/3)G(d/3)A/(dx)- /" [ fcj(x,/3)Go(d/3)A/(dx) 
Jx Js.'' Jx Js.-^ 



> max 



and an application of Fubini's theorem, it follows that, for any e < S and any G G V^, 
(i(q(-;G),q(-;Go)) > e. The proof is then complete. □ 
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Remark 1. Lemma 1 has two important consequences: (a) both ^ and P are separable 
spaces under the metric d; (b) the statement of Theorem 1 is equivalent to saying that 
Vn accumulates all probability mass in a weak neighborhood of Gq. 

Define A„(G) = J|"^j^ (7^, (X^; G)/(7^, (X^; Go) so that the posterior distribution of G 
can be written as 

^"^^^-/pA„(G)7'(dG)- ^''^ 

We now take A = {G: (i(q(-; G), q(-; Go)) > e} and will, as is usual in the Bayesian con- 
sistency literature, separately consider the numerator and the denominator of (19). To 
this end, define /„ = /p A„(G)7^(dG). Relying on the separability of P under the topology 
induced by d (see Remark 1), for any 77 > 0, we can cover A with a countable union of 
disjoint sets Aj such that 

A,(^A* = {G: d(q(-;G),q(-;G,))<r;} (20) 

and {Gj}j>i is a countable set in P such that c?(q(-; Gj), q(-; Go)) > e for any j. Consider 
the fact that 



Vn{A) = Y,rn{A,) < ^VM,) = ^ / A„(G)7'(dG). 

i>i J>1 J>1 V ''^1 

Hence, Theorem 1 holds if we prove that, for all large n, 

Vc > 0, /„ > exp(-nc) a.s. (21) 



36 >0: / A„(G)P(dG) < exp(-n6) a.s. (22) 

j>i V 

As for (21), consider the Kullback-Leibler (KL) support condition of V defined by 



v\g: j /•i:(Go,G|x)A/(dx) <e| >0 Ve > 0, 



(23) 



where if(Go,G|x) =:^^.g^(7j(x;Go)log[gj(x;Go)/gj(x;G)]. liV satisfies condition (23), 
then (21) holds. To see this, it is sufficient to note that the KL divergence of (7y(X;G) 
from ^5 (X; Go) with respect to the measure 7\/(dx) on X and the counting measure on 
C is given by j A'(G, Go|x)M(dx). By the compactness of X, the law of large numbers 
then leads to 

1 " q^(X,;Go) f 

-El"S W X ^ / A'(Go,G|x)A/(dx) a.s. 



The result in (21) then follows from standard arguments, see, for example, Wasserman 
(1998). Lemma 2 below states that (23) is satisfied under the hypotheses of Theorem 1. 
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Lemma 2. // Go lies in the weak support of V and condition (i) of Theorem 1 holds, 
then Go is in the KL support of V , according to (23). 

Proof. It is sufficient to show that for any j G C and any < 1, there exists a 5 such 
that |gj(x; G)/(7j (x; Go) — 1| < whenever G is in Ws, a (5-weak neighborhood of Gq. In 
fact, this implies that 



X 



gj(x;Go)log 



gj(x;Go) 



(7j(x;G) 



A/(dx)< / gj(x;Go) 
J X 



9j(x;Go) 



< 



9j(x;G) 
|^g,(x;Go)(^Y^)Af(dx) 



A/(dx) 



< 



1 — ry 

which, in turn, leads to the thesis, by the arbitrary nature of j . 

Let c = infxgA" Go), which is positive by condition (i) of Theorem 1, and assume 
that G € Ws for a 5 that will be determined later. Note that, for any p > 0, one can 
set Mp > such that Go{/3: |/3| > Mp — (5} < p. Then, using the Prokhorov metric, 
G £ W5 implies that G{/3: |/3| > Mp} < p + (5. Also, note that the family of functions 
{fcj(x, ,9), X e A"}, as (3 varies in the compact set {\l3\ < Mp}, is uniformly equicontinuous. 
By an application of the Arzela-Ascoli theorem, we know that, given a 7 > 0, there exist 
finitely many points xi, . . . ,Xm such that, for any x S there is an index i such that 



sup I kj (x, (3) - kj (x, , /3) I < 7. 

l/3|<Mp 

For an arbitrary xG X, choose the appropriate x^ such that (24) holds, so that 
qj{x; G) 



(24) 



9j(x;Go) 



- 1 



1 

< - 

c 



fc,(x,;,/3)G(d/3)- / fc,(x„;3)Go(d/3) 



+ 

h+h + h 



j |fc,(x,/3)-fc,(x„/3)|G(d;3)+ j |fc,(x,^)-fc,(x,,/3)|Go(d/3) 



We have that G £ Ws implies /i < (5. As for I2, we have 



|fc,(x,/3)-fc,(x„/3)|G(d/3) 



'!/3|<A/p 

< 7 + 2G{/3: |/3| > A/p} < 7 + 2(p + (5). 
Similar arguments lead to < 7 + 2/9. Finally, we get 



|fc,(x,/3)-fc,(x„^)|G(d/3) 



gj(x;G) 
gj(x;Go) 



1 



< 



3(5 + 27 + 4p 
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so that, for given 77 < 1, it is always possible to choose 5, p (by tightness of Go) and 7 
(by the Arzela-Ascoli theorem) small enough such that the right-hand side in the last 
inequality is smaller than 77. The proof is then complete. □ 

We now aim to show that (22) holds under the hypotheses of Theorem 1, by extending 
the method set forth by Walker (2004) for strong consistency. In order to simplify the 
notation, let A„j = An{G)V{dG), where (Aj)j>i is the covering of A in (20). The 
following identity is the key: 



A. 



+ij7A„j - (X"+i)/9.„+, Go), (25) 



where g"'^^(X„+i) — /p(7;(X„+i;G)7^„Aj (dG), I E C and VnAj is the posterior distribu- 
tion restricted, and normalized, to the set Aj. Note that (25) includes the case of n = 
and Aoj =V{Aj). By using conditional expectation, we have that 



E[Ay4.|(yi,Xi), . . . , (y„,X„),X„+i] = A,\f^ y'gr^^(X„+i)g,(X„+i;Go) 

= Aif (1 ~ h[q^^^ (X„+i), q(X„+i; Go)]), 
where q"^^ (X„+i) = [g""^" (X„+i), . . . , q"}^' (X„+i)] and, for qi, q2 € A, 

^(qi,q2) = 1 - X! V9ij?2j'- 



Note that ^(qi,q2) is a variation of the Hellinger distance y^ji=cili'/^l2j~)'^ 
A and that ft,(qi,q2) < 1. By taking the conditional expectation with respect to 
(Yi, Xi), . . . , (y„, X„) only, we get the following identity: 

EjAy^. I (Fi , Xi), . . . , (r„, X„)} = A,\f (^1 - /^[q"^^ (x), q(x; Go)] Af (dx)) . (26) 

Since the Hellinger distance and the Euclidean distance are equivalent metrics in A, it 
can be proven that, for (q„)„>i G £2 and qo G =S, 

/ /i[q„(x),qo(x)]A/(dx)^0 if and only if d(q„,qo)^0. (27) 

The equivalence in (27) can be used to show that /i[q""^J (x), q(x; Go)]M (dx) is 
bounded away from zero. In fact, take Gj defined in (20) and note that, by the triangle 
inequality, 

/i[q"-^^(x),q(x;Go)]Af(dx)> / /i[q(x; G,), q(x; Go)]A/(dx) 

Jx 

- [ /i[q"^^(x),q(x;G,)]Af(dx). 
Jx 
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Since d(q(-; Gj), q(-; Go)) > e, (27) ensures the existence of a positive constant, say £2, 
such that /i[q(x; Gj), q(x; Go)]A/(dx) > £2. Now, choose r] in (20) such that, for each 
G £ Aj, /i[q(x; G), q(x; Gj)]Ajf(dx) < £2, where we have again used (27). Since q""^^ (x) 
does not correspond exactly to a particular G G , we use the convexity of the distance 
/i[q(x; G), q(x; Gj)] in its first argument to show that h[q"^^ (x), q(x; Gj)]M(dx) < £2. 
Note that, in fact, by Jensen's inequality. 



^ h[<f^^ (x), q(x; G,)]i\/(dx) = ^ (1 - |] ^^<Zi(X„+i; G)n.A, (dG)g,(x; G,)) A/(dx) 

< f f /i[q(x;G),q(x;G,)]M(dx)P„A,(dG)<£2. 

Hence, there exists a £3 > such that /i[q"'*J (x), q(x; Go)] A/ (dx) > £3. 
From (26), it now follows that 

E{Al%)<{l^esr^/r{A~) 
and an application of Markov's inequality leads to 

p|E<' > cxp(-n6)| < cxp(n6)(l - £3)"^ ^^A^. 

Therefore, (21) holds for any b < ~ log(l — £3) from an application of the Borel-Cantelli 
lemma, provided that the following summability condition is satisfied: 

Y,^/nA~)<+^. (28) 

i>i 

Lemma 3 below shows that V satisfies condition (28) under the stated hypotheses and, 
in turn, completes the proof of Theorem 1. 



Lemma 3. Let H be the prior predictive distribution ofV and assume that condition 
(ii) of Theorem 1 holds. Then (28) is verified. 

Proof. The proof follows along the lines of arguments used by Lijoi, Priinster and Walker 
(2005). Take S to be any positive number in (0, 1) and (a„)„>i any increasing sequence 
of positive numbers such that a„ —J- +00. Also, let oq = 0. Define G„ = {(3: |/3| < a„} and 
consider the family of subsets of P defined by 

Ma„j^{G: G(G„)>l-<5,G(G„_i)<l-5} (29) 

for each n>l. These sets are pairwise disjoint and [Jn^an,s = P- For the moment, let 
us assume that the metric entropy of Ba„,5 with respect to the distance d is uniformly 
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bounded in n, that is, the number of ?7-balls in the distance d that covers B^^ ,5 is finite 
for any n. Summability in (28) is then imphed by 

Y,^'P{Ma^.s)<+^. (30) 

ri>l 

In order to prove (30), note that Bq^,*- C {G: G(C,'j_i) > 5'} for some 5' > 6. An apph- 
cation of Markov's incquahty leads to VlMa^.s) < {^/S')H{C^_-^), hence (30) is imphed 
by J2n>i \/^^(C^_i) < +00. Next, we have that 

/ \(3\Hm = Y.f |/3|i/(d/3)>^a„_i[if(Q_i)-F(Q)], 

by a second apphcation of Markov's inequahty, so that condition (ii) of Theorem 1 ensures 
that X]n>i ^n-i[H{C!^_i) — H{C^)] < +00. If we now take an^n'^, it is easy to see that 
H{Cf-i) ~ o(n~(^+'"') for some r > 0. For example, 

Y^{n~ l)2[i/(Q_i) - H{C^)] = J2i2n- l)i?(Q). 

ri>l n>l 

This, in turn, ensures the convergence of J2n>i -^(^n-i)" ^^r any a such that (2 + r)^^ < 
a < 1, which includes the case a = 1/2. Condition (30) is then verified. 

In order to complete the proof, it remains to show that the metric entropy of B^^^^ 
with respect to the distance d is uniformly bounded in n. It is actually sufficient to reason 
in terms of the distance over P induced by 

c^i(qi,q2)= / |(?y(x)-g2j(x)|M(dx) 
Jx 

for an arbitrary J € C since maxj (ij(qi,q2) < d(qi,q2) < Jmaxj (ij(qi,q2). Let be a 
set in =S and, for (5 > 0, denote by J((5, ^) the metric entropy of with respect to dj, 
that is, the logarithm of the minimum of all k such that there exists qi, . . . ,qfc e ^ with 
the property that Vq S there exists an i such that dj{q,qi) < 6. The result is then 
stated as follows: for ^a„,s = {q(x; G): G £ Ma^,s}, there exists an Mg < +00 depending 
only on S such that, for any n, 

JiS,%^^s)<Ms. (31) 

The proof of (31) consists of a sequence of three steps. 

Step (1). Define Ca = {I3: |/3| < a} and = {q(x; G): G{Ga) = 1}. Then 

J(2^,^.)<(^ + l)'(l+logi±i), (32) 

where ii' is a constant that depends on the total volume of the space X. It is easy to 
show that, for any j g C, the kernel kj{x,f3) is a Lipschitz function in f3 with Lipschitz 



Bayesian MMNL models 



703 



constant A'x = niaxi<j{|xj — x,;|}. Hence, 

/ |fc,(x,/3i)-fc,(x,/32)|Af(dx)</v|/3i-/32|, 

where K = sup3(.g;^. A'x < +00. Given J, let N be the smallest integer greater than AaK/5 
and cover Ca with a set of balls Ei of radius 2a/ N so that, for any fi^, & Ei, |/3;^ — /Sj | < 
Aa/N. This leads to \kj{x,(3i) — fcj(x,/32)|M(dx) < 5. The number of balls necessary 
to cover Ca is then smaller than N'^. Using arguments similar to those used in Ghosal, 
Ghosh and Ramamoorthi (1999), Lemma 1, it can be shown that J{2S,^a) < N'^il + 
log[(l + 6)/6]), from which (32) follows. 

Step (2). Define ^a,s = {q(x; G): G{Ca) > 1 - 5}. Then 

J('5,^a,5)<A>'^ (33) 

for a constant Kg depending on 5. To sec this, take q(x; G) e ^a,5 and denote by G* the 
probability measure in P defined by G*{A) = G{Ar\Ca)/G{Ga) so that q(x; G*) belongs 
to ,^a- It is easy to verify that (q(-; G*), q(-; G)) < 25. It follows that J{'i5,.^a,s) < 
J{5,,^a): from which (33) follows. 

Step (3). We follow here a technique used by Lijoi, Priinster and Walker (2005), Section 
3.2. For the sequence (a„)„>i introduced before, define 

^^^^^s = {q(x; G): G(G„) >l-5] and ^i;^, = {q(x; G): G(G„) <l-5]. 

By construction, $^a„,i5 C g and ^#a„,5 C ■'^a^_^ g- Moreover, 5 i ^ as n in- 

creases to +00, thus, for any 77 > 0, there exists an integer rip such that, for any n > rig, 
Jiv^'^t.s) ^ Jiv^-^a^^^s)- By (33), it follows that 

J{v,'^a„,s)<Ksai, (34) 

for any n > uq, but, since ^^a„,i5 C g and 5 t (34) is also true for any n < uq. 
Result (31) is then verified by setting Ms = Ksa'^^ . □ 
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